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Abstract 

We develop an algorithm that computes the gravitational potentials and 
forces, on /V point'masses interacting in three-dimensional space. Tile algo¬ 
rithm, based Ocl analytical techniques developed by Rokblin and Greengard, 
runs in order jV time. In contrast to other fast. jV-body methods such as 
tree codes, which only approximate the interaction potentials and forces, 
this method is exact—it computes the potentials and forces to within any 
prespecified tolerance up to machine precision. We present ail implementa¬ 
tion of the algorithm for a sequential machine. We numerically verify the 
algorithm, and compare it& speed with that of an D(jV s ) direct force "com¬ 
putation. We also describe a parallel version of the algorithm that runs on 
the Connection Machine in order Q{log .V} time. We compare experimen¬ 
tal results With those of the sequential implementation and discuss how to 
minimize communication overhead on the parallel machine. 
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Chapter 1 
Introduction 


.’vjmericat simulations of A r -body i i'LC,^ra.<: L loiss are extremely important in 
astrophysics, plasma physics, fluid dynamics, and molecular dynamics. To 
accomplish such a simulation by directly evaluating lhe interaction potentials 
requires computation time on the order of 0[N 3 ), which is prohibitively 
expensive for large A r . Consequently there have been many efforts to develop 
algorithms whose complexity grows more slowly (Lee72j. These algorithms 
aL make assumptions about the distribution of part idea to approximate the 
interaction potentials, and thus arc applicable only to certain classes of A r - 
body interactions.. 

The best- known “fast* algorithm U the tree-code [AppShi, which uses a 
tree-like data structure to form hierarchical clusters of particles. The algo¬ 
rithm approximates the force exerted by a cluster on a distant particle as 
the force exerted by an equivalent mass located the duster's center of mass. 
The overall computational complexity of this algorithm is believed to he 
OldVIog .'V). However, the force approximations arc effective when the par- 
tide distribution is relatively nonuniform, and there are no known a priori 
estimates oF the accuracy of these approximations, although methods have 
been developed for Use in practice [PorS.i] [BHS6J. 

ftokhlin and tarcengard [GR&6] discovered an A'-body algorithm with 
complexity Q{N), The algorithm separates Lhe computation of the fax-field 
potentials from that of the near-field potentials, and expands the fan-field 
potentials into Taylor series. Significantly, given a required precision e, one 
can determine in advance how many terms must be retained in the Taylor sc¬ 
ries expansions, and thus one can control the accuracy of the approximation. 
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Rokhlita arid Grccngand’s approach relies heavily upon analytical techniques 
la develop the series expansions, to derive error rati mates, and to shift ae¬ 
ries expansions between coordinate frames. For the jV-body prohlum in two 
dimensions, this analysis is greatly simplified by modeling two-dimensional 
spare as the complex ptaae and using the theory of functions of a complex 
variable. 

't his thesis applies Rokhiin and Grecugardk idea to develop an 0[N) 
algorithm for the three-dimensional iWbody problem. Although there is 
no direct analogue of complex analysis in three dimensions, we are able to 
exploit the harmonic nature of the gravitational potential in order to produce 
suitable series expansions. We implement the algorithm and experimentally 
Verify its accuracy and its linear growth. We ako implement a parallel version 
of the algorithm and present experimental results. 

Chapter 2 of the thisifi reviews Rokhiin and Greeugardk algorithm and 
their use of complex variables for the two-dimensional problem and points 
Out difficulties in applying the method ill three dimensions. 

Chapter 3 develops a three-dimensional analogue of Rokhiin and Green- 
gard’s method, The basic Idea is lo expand the potentials in terms of spher¬ 
ical harmonics. The analytical basis of the algorithm includes two theorems 
that derive the required expansions arid give hounds on the error terms. 
There are also three lemmas that describe how to shift these expansions from 
one origin to another, t.sing those analytical results, wc present a sequential 
algorithm in three-dimensions. Independently of this work, a similar exten¬ 
sion of Rokhiin and Greengardk method has been carried out by (Jceengard 
(GreSTj, His Lechnique differs from ours in that he uses expansions in polar 
Coordinates and we work in Cartesian coordinates. 

Chapter 4 concerns numerical verifications of the three-dimensional al¬ 
gorithm. We first describe a sequential implementation of the algorithm. 
Wc discuss our choice of a tree-shaped data structure and describe heuris¬ 
tic methods for increasing the efficiency of the implementation. We then 
present experimental results la demonstrate that the algorithm does have 
linear growth and does compute the forces and potentials to within any pre- 
Speeified tolerance up to machine precision. We compare the Speed of the 
algorithm with that of ern 0 (jV j ) direct force computation. For a required 
average accuracy of Id - ' 1 for potential fields, our algorithm will be raster Elian 
the direct force computation when there are more than LOCO particles. 

Chapter 5 presents an extremely fast implementation of the A r -body 
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cods—ft parallel implementation of the algorithm, which mm in O(iogjV) 
time, implemented; on the Connection Machine. We compare the experi¬ 
mental results with that of the sequential implementation and find that the 
parallel algorithm baa a speed -up of 10 for N ~ I, HOC and a speed-op of 100 
for A" — 10,000- For ft large parallel machine, interprocessor communica¬ 
tion is often the bottleneck of the computation. By exploiting regularity and 
locality in communication patterns, by combining and delegating messages, 
we demonstrate how to minimize communication overhead due t.o message 
congestion. 

Throughout this thesis, if not specified otherwise, jY will always refer to 
the total number of particles in a simulation, and p wj|] always refer to the 
highest degree of harmonics retained in an expansion. 


Chapter 2 

The A'-body Algorithm in Two 
Dimensions 


This chapter follows the presentation of the two-dimensional algorithm given 
by Rokhlin and Grccngard in [GH8(5]. We review the major theorems and 
give a heuristic description of the algorUhm, The key idea is to use complex 
analysis t.o produce a series expansion of the potentials, 

2.1 The Multipole Expansion Method 

In the two-dimensional jY-body problem. we consider point charges and 
Coutombic forces. For a charge q tocaLod at the point y — {yi , ^2) £ if J , 
the potential at a poinL 1 = (i lr J2) is q log || x — y f| r If we identify R 2 with 
the complex plane C 1 via (-ii, ■£;) ft fi + we can consider the potential 
to be the real part of the analytic function : O 1 —>■ C T] 

M*) — -S lo«(x - if). 

We caEi expand <& v (x) in a Taylor serins that converge? for any $ with 

I 1 ! > b\* 


= -glogfa: - y) 


<t log(je) + log - ~ j] = q 


logfj) 



k - 


If we retain only p terms of the expansion, the error is bounded. 1 
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F+l 


This observation leads to the following muitipole eTpanjtdit! Given an 
ensemble of«t charges {^,i = 10 Cited at points {pM, i - 1, m}, 

w, th |yh)| < r fl> the potential ^(o(je) at any point x € C 1 due to the charge 
fl ( '* at with |jf| > e ai is 


— q 


(M 


^ 1 (V°) 




( 2 . 1 ) 


I he total potential at the point ;e € C ] due to all the charges, therefore, is 


= S^>( a ) = QIos(jc) ■+ £ - 4, 


JL —3 


(2.2) 


whore 


v=E« M «id «*-jr 

i=1 


-# Q^)' 


57 “ 1 


Moreover, Ibr any p > 1, 


M*) ~ Q log(a) - Z 77 

Jf =1 * 


where 

HI 

* - E k (0 l- 

i=i 

In parti on Ear. Lf |r 0 /ij ^ 1/2, we have 


a t 

c' ^ 


|r 0 


1 - 


1 I 



X 



F+l 


®(x'\ - t^logir) - ]P 


*=i 
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To obtain an approximation accurate to within a. given precision f, it suffices 
to retain only the first p terms in the multipole expansion (2.2) aild to take 
p to be of the order | — log^e]. 


The ability to sum individual expansions of (2.1) to obtain the multipole 
expansion of (2,2} depends on the fact that these expansions i — 

have a common reference point and a common region of conver¬ 
gence. If they' don’t, we need to shift the series {^11(4),t = I,..., m} to 
a common reference point. Also, to insure that af] the shifted series have 
a common region of convergence, we may need to : 'fljp" a series so that its 
region of convergence lies inside a disk, rather than outside a disk. Rokhlin 
and Greengard [GR$t3| derive shifting and flipping lemmas that describe how 
to perform the.se operations Eu time independent of the total number of par- 
tides, These lemmas allow us to manipulate the multipole expansions in a 
manner required by the following fast algorithm. 

2*2 Description of the Algorithm 

RokhUn and Gresngard ! s 0(A r ) algorlc Llul coinputes separately the potentials 
of close-by pendicles (“near-held potentials' 7 ) and the potentials of far-away 
particles (“far-field potentials 711 ). The near-field potentials are computed us¬ 
ing direct evaluation, ff the number of particles near any given particle is 
bormded by a small constant, that is, if Lhc distribution of particles is rela¬ 
tively uniform, then the work required to compute the near-field potential at 
this particle is of order CJ{1). The far-field potential is obtained by evaluat¬ 
ing the p-term multipole expansion described above at the particle position, 
which takes constant number of operations for a presped fled p* The com¬ 
putation is organised so that the total amount of work for computing the 
potentials at all the N particles is of order O(N). 

Mote precisely, the two-dpmendonal space under consideration is regarded 
as a square (Figure 2-1), It is divided into four subsquares, and each of which 
is then recursively subdivided into four sub-subsquares, and So on. This de¬ 
composition Ls represented using a tree-like data structure in which each 
square is represented by a node. A square at level 1 of the tree has four child 
nodes that represent the subsquares at level i + 1. This recursive decompo¬ 
sition is continued until there are only several particles in a square at finest 


CHAP TER 2. THE N -BODY AL GORITHM IN 2- D 7 




Figure 2 1: Decomposition in a two-dimensionaE spirt, with the cotrtspo nding 
data structure 

grain. 1 nder the assumption that the particles -ate uniformly distributed 1 > 
the level of the tree m, i.e. s the level of the decomposition, is usually dloson 
as about [log., ;VJ, Therefore each square at the finest level has in average 
one or two particle in it. 

We define for a square its near-field, far-field, and tRfenadiue-jieAf. The 
near-field of a square consists of those neighboring squares that are at the 
same Jove] of decomposition as the square. In Figure 2.2, the squares labeled 
as B are in the ne&r-ficLd of square A. The far-fiekt of a square is defined to be 
the exterior area of its near-field. The interactive-field of a square is the part 
of its far-fiekt that, is in the near-field of the square’s parent. In Figure 2 , 2 , 
the squares labeled as A 1 are in the interactive-field for the square A, 

The goal of the computation is to compute the far-field potential expan¬ 
sions at all particle positions with 0[N) work. This is achieved by propa¬ 
gating values, first upwards through the tree, then downwards, as follows; 

Initially, for each of the squares at the finest level, we compute the p-term 
multipole expansion, valid outside the square, for the potential due to the 
charges inside the square. The expansion is performed relative to the center 
of the square. 

■ RokhJiti [COTlB 7 ] has an Of W) adaptive- form (?f the algorithm which does nut depend 
fur ili pflifonnante an the particle djctributhDOft, 
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Figure 2.2: Near-field, far-field, and interactive-field 


These expansions are propagated upwards through the tree to produce, 
for every square at every level, a mult ip ole expansion for the potential due 
to all charges located inside the square. The valid region of convergence for 
the multi pole expansion is the far-held of the square. To form the multipole 
expansion for a square at level f, we Lake each of its LeveH-l-1 aubaquaiea, shift 
their multipole expansions to the center of the level-1 square and add them 
together. This is done recursively at each level of the tree while propagating 
upwards through the tree. 

The downward-propagation phage of the computation produces] for every 
square, a local potential expansion due to its far-field interactions. The 
local expansion for a square is a multipole expansion that is valid inside the 
square. The computation proceeds recursively, Suppose wo already have 
a local potential expansion $aM for square A's. parent .4p. The local 
expansion tp *M for the square A is, by the definition of an interactive-field, 
the sum of and those multipole expansions due to particles; in A'z 

interactive-field. However, is relative to the center of ,4p. In order 

to combine it with other multi pole expansions for the square .4, we shift 
to the center of .4- We have, from the upward-propagation phase of 
the computation, the multipole expansion & 4 '(i) for each square ,4 r in 
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interac tive-field, Id order to combine we shift them to the center 

of 4, and flip region* of convergence so that they have a common reference 
point and arc valid inside the square A. The stum of the shifted (i) and 
tarO) ia the local expansion ^(i) for the square A. This computation is 
performed recursively all the way down to the leaves of the tree. 

New we have for every square at the finest decomposition level the local 
expansion due to its fat-field valid in the square. To get the far-field po¬ 
tential at a given particle, we have only to evaluate the toad expansion at 
the particle’s position. The near-field potential is obtained by evaluating di¬ 
rectly interact ions with particles in the near-field. The sum of the near-held 
potential and the far-field potential is the desired potential a". Lhe particle. 

Rokhlin and Grewngard JGRSG] show that the total work required is 
+ bp + ck n + cf)p where £■„ is the upper-bound for the number of par¬ 
ticles in a square at the finest level. The key observation in this estimate 
is that each shifting Or flipping of a p-terrn imntipole expansion takes Q(p 2 ] 
work, and that the number of squares m a given square’!! interactive-field is 
bounded by 27, The total number of squares at all levels is 


4° + 4 1 + 4" 


4 TV+X - 1 4jV - 1 
3 “ 3 


Therefore the upward-propagation phase and the downward-propagation 
phase of the algorithm accounts for the p 2 term ra the estimate. The initial 
expansion step and the final evaluation step in the algorithm is responsible 
for the p term, and the direct computation on near-field potentials is for the 
K. term. 

The overall computational complexity of the algorithm, from the above 
estimate, is of order Q(A'). 

Rokhlin and Greengard [GR&fii also derive that the error estimate in t.he 
computation is C(l/2) f+l , Given a precision requirement e, the number of 
terms p retained in the expansion LS Set as [- bg 3 tj. 


2,3 Remarks 

Tire key to the two-dimensional 0[N) algorithm is that, for the potential <j>[x) 
due to charges at ; = 1,,,.,m}, wc evaluate the potential at each of the 
points by applying the multipole expansion of <p(r), rather 
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chan by directly evaluating and si_i.rnffi.Lisg tlie individual potentiaEs. at 

each x^K Our ability to apply this method relies on the fact that the Taylor 
series expansions of the are absolutely convergent power 

series whose coefficients are Independent of x. We produce the expansion of 
o(z) hy summing corresponding coefficients of the expansions In 

order to do so, these expansions must have a common reference point and a 
common region of validity, or they can be shifted so that they do so. 

The applicability of complex-analytic techniques in the two-dimensional 
case results from the fact, that the two-dimensional potential, viewed as a real- 
valued function on IP* is harmonic, and thus is the real part of a complex- 
analytic function on C. In three dimensions,, the potential function is likewise 
harmonic. l.In Fortunately, there are no corresponding complex-analytic tech¬ 
niques in three dimensions to simplify the analysis, 

In the next chapter, we will produce expansions of the three-dimensional 
potential that are suitable to support the multipole expansion method- 


Chapter 3 

A three-dimensional iV-body 
algorithm 


W* consider low the three-dimensional jV-body problem, where the gravita¬ 
tional potential at a point x e -ft* due to a point mass ^ located at y £ Fp 
is 


^cr(i) = — 


il*-ylT 


(3,1) 


Coulomb 1C potential due to a point charge q has the same formula, except 
that ,u is replaced by — q [Go!59> [Am78]. 

In order Lo apply the multipole expansion method, we must expand this 
potential as an absolutely convergent aeries whose coefficients are indepen¬ 
dent of X, and we must produce an a priori hound on the error when Wc 
retain only a finite number of terms in the series. We must, also be able to 
shift the series expansion to a new origin. 

The potential Function in (3J} is harmonic in everywhere other chan y 
[Kos64j. This suggests the possibility of expanding ^far) in terms of spherical 
harmonics [HobS4). 

In this chapter we prove two theorems that derive the required expan¬ 
sions and give bounds on the error terms. There are also three lemmas that 
describe how to shift these expansions from one origin to another. Using 
these analytical results, wo present an analogue of Rokhlin and Greengard’s 
algorithm in three dimensions. 

At the same time the author derived the results [ZhaS-Tj, Greengard also 


II 
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extended the two-dimensional algorithm to three dimensions GrofiT]. His 
method is similar to the one given here, but the underlying theorems and ex¬ 
pansions are different. Greenyard works in terms of polar coordinates „ while 
the formulas and derivations here are done in terms of Cartesian coe.rdinat.fla. 

In Chapter 4, we present experimental results that demonstrate the er¬ 
ror bound and order-of-growth estimates given here. However, there are no 
comparable published experimental results for the polar-coordinate form oF 
the algorithm. It is unknown how the two different, forms of the algorithm 
compare in practical iTnpLflmflntationa. 


3,1 The Multipole Expansion in Three Di- 
mens ions 


Given an ensemble of particle masses {p^,Ti — m} Seated at points 
£ ff J ,n = the gravitational potential at any given point 

x e IP is 


«(*) = E 

■nal 

Here pt»l = || x || L 

absolutely convergent series 




x — 


_ A 

,d™] 


(3-2) 


-i=rl 


Wc will show how to expand l/H" 1 into an 


n=±\. r .,,rn, (3.3) 

r ,}k 

where Qijjt,^ 7111 ) is independent of t and depends only' on This will 

enable us to get the multipole expans ion for the potential 4>(Vl dLte to all t.he 
masses {^} by summing together the weighted by the masses 

^1^ respectively: 


®M = E E (-f'" 1 ) MS'*"’) 9 ,,„(*). ( 3 . 4 ) 

^jJfe \n=!l / 

The force field is obtained by taking tfie gradient of the potential field 


F 


-V$[x) 
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= E (£ {V"’) <w(» w )) l-ve„,w) 

ijfc Vn=] / 

= z(± (->‘‘0 »;»:»"''>) 0U(*)- (3.5) 

ijt '..tl=T / 

THe series expansion of (3.5) has the flame coefficients as that of (3,4), only 
this time, 6,^(i) is replaced by This enables tis to use''the same 

expansion formulas for potentials derived in the next section to compute 
rnuUipole expansions for forces. 


3*2 Theorems 

This section derives necessary formulas for producing and shifting multi pole 
expansions in three dimensions, together with bounds on the error when 
retaining only a Unite number of terms in the expansions. 


V r 



Figure 3.1- Gravitational field 

Figure 3,1 illustrates the quantities needed in our derivation. The poten¬ 
tial at point x £ R } is due to mass fi at po mt y g K 3 , j/° > j E the reference 
point, and 7 is the angle between vectors y^y and ^$1, Also there are 
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quantities r — 1| ■£ — y || f To ^|| y — y 1 - 0 - 1 || i &nd =|j r — 11- Wc need the 

following relationships later in deriving the multipolc expansion farmuSa. 

From the basic: triangle relationship, we have 

r - \j~BP + r| - iiftno cos 7, {3,6) 

We derive the following theorem which produces an expansion for 1/r in 
a region of analyticity outside a sphere. This is essential in obtaining the 
multipole expansion {3.4), 


Theorem 3.2.1 Using the quantities g-t shown in figure 3,l f if R > ?‘q, 
that is r if t = (sFipiapSa) measured reintire to is outside the sphere of 
radius fp tenHered at tr^, then 


u:h ere. 


1 _ “ d'+>+* n\ 

i+i+ti 


sad y tl V'i, vs are the direction cosines of 

[f me only terms of t + j + k < p then the error satisfies 


<c 


n> 

R 


p+t 


where C is independent of p. 


(3.?; 


(3.8) 


(3.S) 


Proof. 

In Figure 3.1, the Cartesian coordinate distance betivcm the point .9 X 
and y Is 


=|| i - y || = 


S [t*fl “ Iff) “ (.VS " Iff)]' 

) P=t 


{ 3 . 10 ) 
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We expand 1/rfr, y) ag a Taylor serins in powers of (yp - y ^) and get 


t 


r(af s |f) r{® 3 s) 


it=rPi 






K*» v) 


v=* a > 
(3.11) 

I Ilf! subscript y = yW following the brace indicates that after the differenti¬ 
ation is done yi, 513 , 5 k of point y should he replaced by of point 

jf^V When r lies outside the sphere of radius r 0 centered at pW, the above 
Taylor series converges absolutely and uniformly, we make the substitution 
¥ = y (0J before the differentiation is performed. Using the identities 

Gl^vi) = "si* Gtevi) f , = 1 -2.3 


%J0! 


and 
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H^v) 

the Taylor series (3.11) becomeg 
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(3.12) 


Let s/g — -M ffQ. Since J'i, r'a are the direction cosines of trfftiv. it 

follows that 


" 3 

l .3—1 


- E 


ct! 


♦+>+*=*«. 

Substitute this into (3.12), we get 
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iA t-4 \4‘ 


<F 




„i+j+i 


(s) P'»> 


(3.13) 


This completes the proof for (3,7). 

In order to derive the error bound of (3.9), wc will use the fact that lfr 
can He expanded in powers of R using the Legendre Polynomials [And85], 
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< 1 


£ ^fr^fc-osT) if ^|>1 


i j ~ r ^ b 


The polynomials fa (a;) are called the Legendre Polynomials 


ie 


(3.15) 


fW*) = 1, P,{x )a -(Si 2 - 1), •■• 

The series of (3.15} is absolutely convergent. Notice tha t the Legendre 
Polynomials P( T (oi>S 7 } vary with the directions of the vectors and 
and thus is a function of the point z at which the potential is to be evaluated s 
as well as a function of the mass location jr. 

Now we can complete the proof of Using the notation 


d" {= d 


\$=l 


dzi- 


the series of (3.12) becomes 


\ = B ^ E 


(-I) ( 




&* /I 


^iirj R 


m-] 


a\ *drg 


Cs)- 


(3.15) 


Notice that there is a term-by-term correspondence between this series and 
the series oF (3.15), By equating terms, we have 


P a ( cos T )=(^iy 


r ** 1 d* y i 




dr* 




(3.17) 


£F L 

Surprisingly, -j—cr- 7 :(—] only depends on 7 , not R. 


a! ftr^R' 


Using the correspondence between (3-16) and (3.15), we have 

i+J+£<P 


M = 


1 _ fl_\ 
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Since the Legendra polynomials have the property of |J»*(eoi 7 )| < 1, the 
error therefore lh 
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E 


i-=P+i 





IHj 
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fH-l 


This completes tile proof. 


Using the results, derived in Theorem 3,2.1, we can produce the muEtipoEc 
expansion of (3.4) in Section 3,1. As in Section 3.1, suppose that all m 
par tide masses are located within a spherical region with radius r 0 centered 
at ir (0> T that k, 1 1 yH — 9 W> || < r a , n = I, ...,rn, we have from Theorem -3.2.1 
the series expansion of I/r*"> converging for any x with J| £ - yW ||> r 0 



nj.jt =£, ' lk 




and the error estimate 


1 +>+L<r Q,± J+k 

( 1 ^ 

< C 

ro 

rin) 


1 

X - || 


The multi pole expansion of (3,4), therefore, is 


n — 1 ,m. 


<Df£.l = > Qiit-:-, , _ 
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(ii * - *< o > n) 


( 3 . IS) 
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where 

IK 

“ij* “ S - 

n=l 

Furthermore. 




where 


j+j+*-<p 

E q fj t 

tj.fc 


■ -L: 


( ] ) 

C /" Tj 

I'D 

\|| a-yW || j 


II 1 - y <a> II 




m. 

C' = E C/ n K 

hr 1 


The following theorem produces an expansion for L /r tn a spherical region 
of anaJylicsty, which is the basis for hipping the region of analyticity in 
shifting^, as required by the fast, algorjthrn,- 

Theorem 3.2.2 Using the humilities in Figure ,5. j again, if R < i-Q r that is, 
if i = (.ii, arj, 1 $) measured relative to y { - a> ite.s inside the sphere of radius 
rente-red at (Assn 

- = E (3,20) 

r iJJt-e 

tiffttre 

(_iv+<+‘ uj iw uj 

fci» = ■ [3.21) 

T b *v — L- ;:=M . j —Ij 

a _ / -J \ (i+j+^-Q-^-T)l 

\i + J + k - a - 3 - 1} (t - a )1 (j - 3)!(Jt - 7 )! 

*= 

fte ft i r i 4 a-^> ;-''i are the directio n cosines of 

The error bound for truncating ierms of i -r j -r k > p satisfies 
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Pro of. 

Using the relationship in {3.0), We have 

1 _ 1 

r 


'i? 2 + rj - 2 /Jr D cos 7 


1 






— + 1 = 2 —«h 7 

^rW r D 


(323) 


Denote l>y_^vr 3l -r 3 tfoe direction cosines of yl 0 !** and z/ E , ^1,^3 the direction 
cosines of yWijr. Then 

™ 7 = 7] iv( + + r 3 t^ 


where 7 is the angle between y^x and yW^. This leads to the equality 

2— CO z 1 = +j^ + jj: + xtvi + 

-■ r a/ n> rjj T 0 

If we introduce the variables 

e 1 — —= — and 

^ i-q r 0 

* = k (*i - 2n ) * ,3 = h (*a - ), 7 = *l( fa - 2iC3) , (3.24) 

equation (3.23) becomes 

L _ J__J_ 

r ^ y'l + (a+'^+ 7 }' 

We cao expand this as a TayEor series 

-“ilW) ( “^ +Tr ’ < 3 ' 25 > 

that is valid fbr |tt + $ + 7 ) < 1 . But 

(Q +,j + ,r= £ 
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thus 


1 OCf 

= - £ >S*«W 


(3.26) 




v.J L-urc 


r- _/ -J \ (l+j + t)? 

1,1 \i + j + k) ilj'.hl 


Substitute ot,f3.*f with {3.24}, the Taylor series of (3.26} becomes 

where 


(3.27) 


W liJ UJ 

'teD ff=0 it—D 

(’ a n ) f 3 3 ) (*; 7 )e»n i -“(2.^-*(2, s )*-* 

Write (3.27) explicitly in powers of 37]-S?. &nd £ s 






where 


. ^ 

11 - —-— 


s+j+4-M 1 
r (J 


As in Theorem 3-2-1, the error bound for truncating terms of x + j 4- fc > p 
in, (3.20) satisfies 

|^| p+1 

M <0\- 

ITQ 


In the following lemma, we produce a formula which shifts the center of 
A multipole expansion valid in a region of analyticitv outside & sphere. 
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Figure 3.2: Lemma 3 2.L 


Leixim-a 3.3.1 


Given a muUipole expansion 


*<*) 


Z 


&,P,*(WO 




ty*+$+7 

dxfd-4dx£ 



{3.23) 


with respect fr> 3/ 1 ' 0 * 1 ► tii'zW in the region outside the Sphere of radius r$ centered 
at y i0 \ suppose we shift the. reference point y t0) to ^ with || yW - if™ || = 
Ar, 08 Shown in Figure &£. Note that /T =|| af — yM ||, The new muUipolc 
eipmSMU of (S-2S), with respect to y' 1 , m the region outside the sphere of 
mdius ro + Ar centered at y^ 0 - 1 is 


*m - e 


c fjfc 


^+J+^ 


dXidXld 


3(1) 


(3.29) 


where 


i i * 

!C f 

n=il 3=C>>sO 


(3.30) 


Here X — (Ai. X? t X$) are the Cartesian coordinates of the point x relative 
to y |L ', and ere the coefficients in the expansion for \ fR with respect to 
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The- error bound for retar uiu_(j ten™ of i -hi + k < p in {3.29} satisfies 




| r a + £r 


R 1 


p+i 


{3.31) 


Proof, 

Using Theorem 3.2.L we expand 1 //£, with respect to jf 1 ^, in the region 
outside the sphere of radius aIf centered &t as 


1 ^ , 3‘ +J+t /1 \ 

R ~ tMo^dx\dx*dx$ \r0 # 


(3.32) 


where 


C = (-iy+ t+k ~-J. - rJrir*. 


L r i 4 2 a ■ 


arc direction cosines of y'Wyt 0 ). Notice that after the shifting there 


IS 




i = 1,2,3 


Therefore 


3* f dXi ‘ 

Substitute (3,32) into (3,28) 5 we have 

to fla+tf+'r oo f | \ 

= ,J}u‘ ,lk dXtdXidX$ \SJ 


that 


j& 


as yj 


^(■^") 2J 2Lr ii ytH-i a y*'-** ( Or ) 11 

ai ^=o rj r t=u yA, o A, aA 3 Ait / 

which, is valid outside the sphere of radius r n + ir centered at i/W. Make 
substitutions of er + ? = t 11 , 3 + j = j\-y + le - £ J . wc get. 

“ ni'+i r +A ,r / i \ 

* (Jf> = l ,j& m0 erfl '»x{»x(ox? (ff) 
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where 


f i J 


Ci'S-V “ 5^ 5J 51 (1 ^CH a i J -Q I j'- 1 L^r_ J ,r 


= 1? ^t»D Ti-=£| 


So far we have proved t-he expansion (3.29), 

Since the muitipole expansion of (3,29) Is unique in terms of the deriva¬ 
tives of 1 fR\ we could estimate the error hound for truncation in (3,29) as 
if We expanded it directly with respect to From (3.19) we know that is 


|f| < C f 


iro-hir 

I R 1 


p+] 


We now show how to convert a multEpole expansion into a local expansion 
valid wjttun a spherical region of analydeity, by shifting the reference point. 
1 he region of anaJyticity is flipped. 



Figure 3.3: Lemma 3.2.2 


Lemma 3.2 .2 Given a m 


e expansion 






nj?,f=0 


dtfd4&*! (ii) 1 


(3.33) 
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to y { ‘ a \ valid m the region outside the sphere, of radius r^ centered 
at suppose this time the reference point is shifted Ar to with 
Ar > r$, as shown in Figure 3.3. The new multipole, expansion of ($,33), 
with reaped to j^ a ) p Shis time inside a sphere of radius (Ar - Fg) centered 
at jjf'W is 

00 

<S>(X)=Y,<**XixiX* (3.34) 

ijk 

whe re 




1 


OO 


]C + *)!(i + £)!(* + T)! 

<xffw=0 


(3.35) 


Again X = (X^X^X*) are the coordinates of the point x relative: ta , 
tifji are. the coefficients in the expansions for If R with respect to 
The error bound for retaining terms of i + j + k < p in (3,Sf) is 


M < t T 



0+1 


?n 


(3.36) 


Here K «| r - [|. 


Proof. 

From Theorem 3-2,2 we can expand 1/J? S with respect to y* 0 ), inside a 
spherical region with radius (Ar - ro) and urn ter y( 01 as 

4“ E twA'|.YJX‘ (3.37) 
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— If 


{i+j+k~ct-&- 7)! 

a - a)f u - m* - t)i 


(3.33) 


where 
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T uTti aae direction cosines of Substitute (3.37) into (3.33) 


*(*) 


^ ^O+iJ+'f / <50 ' 

ewiar ? djf g Lj^n 


■Jg oc* ota -3? 

5 Z ^ S ^J £ 

*,tf,T=0 i=y 




yi-d Vj*"^ V k —Tf 
A 1 - a 3 


(3.3!}} 


The region of anaiytidty Is the inside of the sphere with radius {Ar - r<>) and 
center at y^°h After maiing substitutions of i-a = i* t j = j\k~ 7 = 
we get 

'hivY) = ]T ^rj^rA'l'Y^.V/ (3.40) 

where 


dvt' 


J 


0 O 


X] a <*Bt +*.t r 4 a.v+^tG* + ct )3{ f ■+ 0 )!( k 1 + 7)!, 

Cip?iTT=0 


(3,41) 


This completes the proof of (3.34). 

Sim.i.ar to Lemma. 3.24 h ihe error bound for retaining only terms of r + 
3 + k < P in (3,34) is 


M < & 


R' 


jj+i 


Ar — r 0 


We also derive a formula for shifting an focal expansion within a spherical 
region of anaiytidty. 

Lemma 3.2.3 Given <t local expansion 

$(i)= ^ 

iJJ*6 

with n mpett to sr* c \ valid tn ffte region inside the sphere of radius r Q centered 
at we expand *(apj inside the sphere, of radius. r 0 - At centered at i/M, 
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Figure 3-4’ Lemma 3. 2 .3 


after- is sj’ujfod Ar to y'W with Ar < T( } as shown in Figure $.j r as 

*[X) = £ d^X{X^X] 

twfcfo X = (Ac neir coordinates of the point x with respect to y' l< '' , 

and 


C r) e :")(*: 

Here y'^ — |/°) = (Aa^. Axj, 


- Y e i+o- I j+j3,Jt+-v | 
aji.-ymO 


Ax^Aj*Ax3 (3.42) 


Proof. 

This can be derived by re-expanding the expansion 

■20 

^(X) - ^ + Aj^^Xa + Aj2) j (X 3 + ijjf 

u-fc™i> 

in powers of Xi T Xa n AY No truncation errors are introduced in the calcuia 
tion of (3.-12). 
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Having derived the potential in the form of the multipole expansion 


we retail from the equation (3,5) of Section 3.1 that the force is obtained by 


where 


Therefore, 


? = 


3 d 
3xy fcS 




rej 


0**1 


r xi 


2 -t a , 


- £ je^ar^ 

T.|J.t=l> 


-l,t 
T st 


T | 3?a 

i^j,.fe=a 


3.3 A Sequential Algorithm 

In this- section we give a complete description of the tiirce-dinnetisionaL jV- 
body algorithm. This algorithm is similar to the two-dimensional algorithm 
of Rokhlin and Greetlgard! described in Chapter 2, however this time, we use 
the analytical results obtained in the previous section. This formulation of 
the algorithm 3S for a sequential computer. We present a parallel version in 
Chapter 5. 

The three-dimensional space under consideration is taken to be a cube 
{Figure 3.5). We apply the operation of subdividing a cube into eight identi- 
caJ ,subcubes repeatedly, until a subcube has only a few particles ] Q if, When 
Ilie particle distribution is relatively uniform, the level of subdivision l r ap . 
proximaftety n s []og s A r J. By convention, we say that the initial cube is 
at level o, and the atomic cubes, i.e., cubes at the finest level of the spatial 
refinement, arc at level n. As in the two-dimensional algorithm, we define lor 
a cube its near-fieEd, far-fidd and interactive-field, The near-field of a cube 
is defined as consisting of those cubes that are At the same level as the cube, 
and have distance to the center of the cube less than of the cube edge 
length; the far-fidd of the cube is the complement of its ucar-field and itself: 
and the interactive field of the cube is the part of its far-field contained iii 
its parents near-field, 
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Figure 3,5: Decomposition in a three-dimensional space 

For a cube t at level /„ we denote by o[{i) the umltipole expansion, valid 
it) the tar-field of the cube, for the potential due to particles in the cube; and 
by ^l(ar) the local expansion, valid inside the cube, for the potential due to 
those particles in its far-field. Both and are expanded relative to 
the center of the cube. 

A description of the algorithm is given in Figure 3.S. n is the total levels, 
of spatial refinement. 

Step (1} computes initially' the multipole expansion o*(i) for each atomic 
cube. This is obtained by adding together series expansions, relative to the 
center of the cube, for potentials due to individual particles ill the cube. 

Step (2) computes the multipule expansion ${i) for each of the cubes at 
all intermediate levels in an upward manner. For a cube i at level I, the 
multipole expansion is computed by summing £ J+I (r) of cube i's eight, 
subcubofl that are shifted to the center of the cube i. 

Step (3) computes the local expansion for each of the cubes. For 
a Cube i at level J 1 , the computation consists of two parts. First it shifts 
tic - (tf) of cube i"’s parent to the center of cube j', which constitutes the far- 
ftekl interaction. Then, it computes interaction with its interactive-field by 
summing of the interactive-field that are shifted to the-Center of cube i. 
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The mm of these two parts is the total expansion $t{x) for the cube. This 
is recursively performed when walking down the refinement tree. 

Finally step (4) computes the desired potential for each of the particles. 
The far-field potential is obtained by evaluating ^(i) of the atomic cube 
the particle belongs to at the particle position. The near-field potential is 
obtained by a direct computation. 

The error estimate for the algorithm is given by (4,31) and (3.36). By 
the definition of the near-field and far-fidd for each of the cubes, we have 
( r ° + Ar)/R r < 1/2 in (3.31) and R'/{&r - r a ) <1/2 in (3.36). For a 
required accuracy of t, we choose p = Eogj tj. 

Let's analyse the computational complexity of the algorithm, The total 
number of the cubes at all level!! of the subdivision is 


8 ° + 8 1 + ■ - ■ + S’ 1 




Step (1) takes 0{ jV) work to compute A" series expansions for all the A r 
particles. Summing these expansions to form for every atomic cube 
takes another O(jV) work. 

In step (2)j it takes one shifting and seven summations of expansions to 
compute d{^} for each of the cubes. This takes constant amount of work for 
a prespecified p. Since the total number of the cubes is of Order /V r the work 
to compute ${x) for all the Cubes in step (2) is of order 0(i\ r ). 

In order to compute *) for each of the cubes in stop (3), wc need to do 
one slid ting on the parent ■d'l'a?) and one shifting for each of the cube in its 
near-held, and then sum tile resulting expansions. Since the number of cubes 
in each cube's inter active-field is bounded by 567, it takes a constant amount 
Df work Co compute 10 (ar) for each of the cubes for a fixed p, Therefore step 
(3) takes 0{JV) work in total to compute $(x) for all the cubes. 

We have chosen the level n of the subdivision so that the number of par¬ 
ticles in the near-field of each atomic cube Ls bounded by a small constant, 
Ilius in step (4), for each of particles the direct computation on the inter¬ 
action with its near-field takes work of order (9(1). Evaluation of the local 
expansion at. the particle position takes again constant amount of work for a 

fixed p. T hns, step i,4) takes 0{N) work to compute potentials for all the A r 
particles. 

The overall complexity of the algorithm described in Figure 3.6, therefore, 
is the sum of those of step (1) through (4), that is, order O(N). 
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(1) Initial 'txpandon*; farwh atomic cube i t compute 

for i froB 1 to & n with i do begin 

compute for each of the particles in the cube t tha s*rie$ 
expansion fox the potential due to the per tic In by using 
Theorem 3 r 2,l; 

suit these expansions for particles in th* tuba t to fom 
i'J. 1 ■■ [xj for the cube t, 
end. 

{2) Upward-path; for each of the tubes i at level i of the spatial refinement, 
compute in an upward manner. 

for 1 from n - 1 to 0 with et«p —1 do begin 
for a from 1 to 9 ; with step 1 do begin 

shift ^ l+l (r) of cube i's bight subcubes to the center of 
the cube i by using Lemma 3.2,1] 

Sum the shifted (x } of cube t"^* eight subcubes to form 

d^fx) for the cube i . 

end. 

end. 

(3) Downward.path; for each of the Cubes i at level L compute ^J(x) jn ,% 
downward manner. 

for i from 1 to n with step 1 do begin 

for i fren 1 to 9 ‘ with step 1 do begin 

(3b) shift o£ cube f*s parent cube to the center 

of the cube i, b^ using Lemma 3-2-3; 

(3b> shift iV?|:rj of the interactive-field to the canter of 
cube i , by us ing Lem ma 3.2.2, 

EUJL the rasulting expansions Of (3a> and (3b) to form 
for the cube i . 

end. 

end. 


Figure 3-f>- A sequential algorithm 
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[ 4 } Pin*[ evaluation; For «dl of particles, compute ttie potential at tbd parSicte. 

for particle p from 1 to N with step 1 do begin 

4a) evaluate at t*ie atonic cube i the particle p belongs 

to at the particle position; 

(4b) CQ-aputu directly interactions uitfi particle* in its M *r- 
Eiald and, in the cube i f 

add (4a> and C4h> ae the desired potential for the particle y, 

end. 


Figure 3.T: A sequential algorithm (con't) 


Chapter 4 

Numerical Verifications of the 
Three-dimensional Algorithm 


This chapiter numerically verifies the three-dimensional algorithm of Chap¬ 
ter 3 with respect to the achieved accuracy and spued of the algorithm. We 
first describe a sequential implementation of the algorithm. We discuss our 
choice of i 3-D tree data structure, and describe heuristic methods for in¬ 
creasing the efficiency of the implementation based on a. detailed analysis on 
the time complexity of the algorithm. We then present experimental results 
to demonstrate that the algorithm docs have linear growth and does compnte 
the forces and potentials to within any prespectfied tolerance up to machine 
precision. We also compare the speed of tile algorithm with that of a direct 
compnEation and find that for a required average accuracy of 10 -4 for poten¬ 
tial fields our implementation will he faster when there are more than 1,000 
particles. 


4.1 A Sequential Implementation 

4.1.1 3-D Tree Data Structure 

As described in the previous chapter,, the algorithm requires the data struc¬ 
ture used for the implementation to hierarchically decompose a three- 
dimensional space and to support operations such as insertion, deletion, and 
searching. Therefore a 3-D tree data structure is an obvious choice. 

A 3-D tree [knuSl] is a balanced eightfold-branching tree, in which each 
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kvc] of nodes corresponds to one level of the three-dimensional spatial de¬ 
composition, The 3-D tree 19 an extension of the two dimensional tree data 
structure discussed in Section 2.2 of Chapter 2. In the 2-D tree, a node 
and its four children correspond to a square and ita four a ubsquarag; in the 
3“D tree, a node and its eight children correspond to a cube and its eight 
subcubes- Note that the root of the 3-D tree corresponds to the original 
cube of Space under consideration, while the leaves of the tree correspond 
to atomic cubes. Since each node of the Lrcc corresponds to a cube iu the 
spatial decomposition, we regard the word “node* and the word “cube" ag 
interchangeable. From now on we will refer to a node as if we refer to the 
corresponding cube in three-dimensional space, and uge the phrase “the cen¬ 
ter of a node" instead of “the center the cube 1 '. Each node of the tree has 
pointers to its children, and pointers to its near-field and itg interactive-fidd, 
as defined before. The near-held and the interactive-field for each node could 
be computed at run time, but in practice this is too expensive, Li a long- 
term simulation, which iterates over many time steps, these fields would have 
to be recomputed again nod again. Therefore, we initially establish static 
pointers to each node's near-field and interactive-field, This approach Raves 
time but requires extra storage space for these pointers. 

Particles are initially Contained in leaf nodes of the tree according to their 
positions. After each iteration of a simulation, information about particles, 
such as positions, velocities, etc. is Updated- A particle is moved to a new 
node if it crosses boundaries of an atomic cube. Coefficients of expansions 
at various stages of the computation are stored in 3-D an ay a held by each 
node, The hierarchical clustering of expansions is done by walking up and 
down the tree. 

The level of spatial refinement is chosen approximately as Log^jV. Thus, 
the 3-D tree is about logg N levelg deep. 


4.1-2 Implementation Details 

In this section we discuss the implementation of the algorithm on a Symbolics 
LISP machine, and describe techniques for increasing the efficiency of the 
implementation. 

Speed is the major concern in our implementation, One way to measure 
the efficiency of our implementation is to compare it with an implementation 
of the direct computation. For the first few implementations of our algorithm 
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on the LISP machine, the code ran extremely slowly. For p — 3, the crossover 
point where the running time of onr method falls below the running time of 
the direct method was at about. 10,000 particEes. 

Performance monitoring revealed opportunities for both machine- 
independent and machine-dependent, optimizations. 

We discuss the machine-independent optimizations first. Shifting expan¬ 
sions due to interactive-field potentials is very expensive. This accounts 
for most of the hidden constant- in 0{N). For each node in the tree, we 
need to do 567 shaftings on expansions, since each node has -567 nodes in 
its interactive-field. We can reduce this cost by grouping together nodes in 
the interactive-field a More specifically, we replace eight child nodes by their 
common parent node (we call it a super-node) if all ei^ht nodes ar« in the 
interaction field of a single node. Using this heuristic, each node has only 
140 nodes in its interact]ve-field. Numerical experiments indicate that this 
modification speeds up the algorithm by a factor of about S. 

Another source of inefficiency in the initial implementation waa the re¬ 
dundancy in computing coefficients in shifting formulas. The repeated calcu¬ 
lation of factorials and permutations was removed by storing factorials and 
permutation* in a table. There is abo repeated calculation of some com¬ 
mon factors in the formulas. We simplified this part of the computation 
by extracting common factors in sum*,, and by canceling common factors 
appearing in successive stages of the computation. 

For the machine-dependent optimizations, we found that a major portion 
of the running time Is spent on manipulating coefficients of expansions in 
shiftings and evaluations. The bottleneck of the computation is the floating¬ 
point calculation and indexing of arrays storing coefficients of various expan¬ 
sions. The array reference dime was reduced by representing a 3-D array as 
a 2'D array of 1-D arrays, or as L-D array of 2-D arrays. This minimizes 
the array reference overhead since 1-D and 2-D arrays are better supported 
on the LISP machine^ Because of the extensive use of array reference, the 
above improvement was significant. Other machine-dependent optimisations 
included using tight loops in. frequently called procedures;, and avoiding the 
creation of new arrays in manipulating expansions whenever possible. 

The above optimizations greatly reduced both the time and storage re¬ 
quirements for the computation. The reduction, in storage in turn reduces 


L The idea is due to Rich Zippel. 
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the time due to disk paging. Overall!, the m^ine-independent optimizations 
reduced running time by a factor of S and the machine-dependent optimiza¬ 
tions provided an additional factor of 4, The running time crossover point 
with the direct, computation is now at about 1,000 for p = 3 (See Table 4.1), 

4.2 Experimental Verifications of the Algo¬ 
rithm 

4-2*1 Accuracy of the Algorithm 

In this section, we study the actual achieved accuracy of the algorithm, and 
compare it with, the theoretical prediction given in ChapLcr 3. We first test 
Our algorithm on the Pythagorean configuration of three bodies and observe 
how the error of the computation scales with p, Then we test the algorithm on 
two typical distribution models, the uniform distribution model and Plummer 
distribution model, each with 1,000 particles, and again determine bow the 
error varies with p, 


(a) Tile Pythagorean Configuration 

The Pythagorean configuration of three bodies was first Investigated bv 
C. Burrau in If) 13 [SP67], It is Pythagorean not only in the geometric sense 
but. also with respect to the masses. The sides of the triangle formed by the 
three bodies are S, 4, and 5 and the masses of the three bodies are also 3, 
4, and b, The configuration is such that the body with mass X is Situated 
at the vertex opposite to the side of Length r, where x is one of 3, 4, or 3. 
The initial configuration of the problem and the coordinate system used are 
shown in Figure 4.L Initially the three hodies are situated in the z = 0 plane 
and have speeds zero, consequently the motion Is planar. 

We use our algorithm to compute the accelerations of the three bodies 
and observe how error scales with p. The experimental resutls are given in 
3able 4.1 and plotted in Figure -1.2, We define the relative error in each of 
the accelerations ns 
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Figure 41: Initial ccnfiguration of the Pythagorean configuration of three bodies 


where o^fljdJstei is the calculated acceleration OH body i using our method, in 
single precision arithmetic, and is the actual acceleration on that body 
using the direct method of force computation In double precision arithmetic 
and is therefore accurate to the machine round-oif error of double precision 
arithmetic. 

In Fable 4.1, jg the maximum error of ail the relative errors in ac¬ 
celerations 

c - MaxlJft *<»> I 

c maj - -1 Safari,*/ i 

c rm# is the root-mean-square error 


*'H| J 



where ft = 3 ts the number of bodies in the test, and is the 

relative error in total potential energy. 
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p 

f -m4± 

Crma 

Ifll- critT^M I 

1 

0,5S 

0-46 

0-0082 

2 

0.44 

0.27 

0.0020 

3 

0.17 

0.11 

4-2 x 10- 4 

4 

0.073 

0.040 

5-7 x lO-' 

5 

0.040 

0.026 

2.7 X 10-* 

6 

0-010 

0.012 

8-0 x 10-* 

7 

0,(1036 

0-0055 

1-3 X 10-“ 

a 

0.0039 

0.OO53 

6.6 X 10-* 

9 

0.0019 

0-0012 

6.3 X 10’ 6 

m 

9,0 x hH 

5-7 x ld“ A 

3,5 x l0 -b 

Ll 

4.3 x 10-‘ 

2.S x 10^" 

1.1 X 10“° 

12 

2,0 x ltH 

“l-3 x 10'" 

“|.7x 10- T “ 

13 

9.3 x IQ -5 

6.0 x 10 -a 

1,5 x 10~ T 

14 

4-4 X Lij- 5 

2.9 x 10" 5 

" M x 10“ a 

15 

2.1 x 10 -3 

1,4 X I0 s “ 

6,5 x lt)-» 

16 

l.li x to -5 

6.6 x 10“* 

S.Tx 10- a 

17 

5.0 x ltr* 

3.2 x ID- 41 

5-5 x l0 -fl 

Id 

2.3 x fir^ - 

1.5 X 1C -15 

2.Q x ]<r a 

iy 

1.1 x 1.0'*- 

7,8 x 10 -7 

1.7 x I0' B 

20 

5.5 x 10- T 

4.0 x 10- v 

" 3.6 x 10- R 


Table 4.1: Accuracy test for the Pythagorean configuration of three bodies, p is 
the highest decree of harmonics letained in an expansion. 


Table 4-1 shows that the accuracy of the algorithrei is improved by ap¬ 
proximately a factor of 2 when p is incremented by 1 for p greater than 2. 
Consequently the dots in Figure 4.2 are distributed nearly on a Line, since 
the error axis is scaled logarithmically. 

Note that the near-field in this implementation has been defined so that 
the ratio appearing in the error bounds (3,31J and (3.34) of Chapter 3 is 
1/2. The factor of 2 decrease in errors lor each increase in p L? thus expected 
from the formal analysis. The results presented in Table 4,1 and plotted in 
Figure 4,2 match well with the predicted error bounds. 

The experimental results exhibit, two phenomena that warrant further 
discussion. First, the error £ r ^i lh - w . in individual accelerations does not dc- 
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Figure 4,2: Plot of the accuracy test for the Pythagorean configuration of three 
bodies p is the highest degree of harmonics retained in an expansion. 

crease nKuiotocically as p in.creases. Second, the calculated accelerations have 
QOOSe ro components in the ^-direction, which LS perpendicular to the plane 
of the three-body configuration. Below, we discuss how these phenomena 
arise, 

Nonmonotonic, dcr.iv.wt $/ tncnSMinj p 

The nonmonotonic decrease of for individual accelerations with 

increasing p is due to the fact Lhat a imltipole expansion is a mixed-sign 
Taylor series. The error due to truncating a rnultipole expansion is also a 
mixed-sign series. Consequently, the error from truncating the nnuitipole 
expansion does not necessarily decrease monobonicaUy when retaining more 
terms in ths multipole expansion. Nevertheless, the error in individual ac¬ 
celerations is always bounded bv („„ in Table 4.1, which. I&e the predicted 
error bounds, decreases mouotomcally as p increases. 

Nonzero z component /or accelerations 

The nonzero i component in our computation results from two kinds of 
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truncation: the primary (nmcatwn and the .secondary truncation. We note 
that, each term of the muitipole expansion in Lemma 3.2,2 of Chapter 3 
la approximated by a convergent Taylor series, which we wi El call term se¬ 
ries. In computing the milUipole expansion using Lemma 3,2,2, the primary 
truncation truncates the multipole expansion and the secondary truncation 
truncates each term series , 

In the muitipole expansion computed from Lemma 3.2.2, the coefficients 
in each dimension arc determined by parameters in alt three dimensions. If an 
infinite number oF terms were retained in computing both the term scries and 
the multipole expansion, then the z component of the muitipole expansion 
would he zero due to cancellation among nonzero z terms. Since oniv a finite 
number of terms are retained, the z component of the muitipole expansion 
is not completely canceled. 

Intuitively, w e might expect that retaining more and more terms in the 
term series would reduce in general the secondary truncation error in com¬ 
puting the muitipole expansion and in particular the error in 2 - direction. But 
the experimental results show that the intuition i? unfounded. The reason is 
that the muitipole expansion and the term series are mixed-sign series. The 
fact that each of: he terms in the muitipole expansion is better approximated 
by the term series does not necessarily guarantee that the resulting truncated 
muitipole expansion is more accurate. 


(b) Uniform Distribution of 1,000 Particles 

We verify the accuracy claim of our algorithm using a configuration in 
which particles are distributed uniformly. In this test, the system lias 1,000 
particles, each of which has mass 1/100(1, The results are shown in TabJe 4.2. 
The errors e mar . Ema*, and are defined as before. The results 

match well with the prediction. 


(c) Plummer Distribution of i,ono Particles 


L\c also test our algorithm to determine if the accuracy of the algorithm 
is sensitive to the distribution of particles. The Plummer distribution is a 
nonunifornt distribution with the density profile [Her86] 
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p 


®r™ 

^PCfF"-<FnJ—Cliff 

1 

n.56 

o.nc-i 

5.0 xlC - - 1 

2 

0,11 

0,016 

1.6 x 10" 4 

‘T 

o.o-ss 

0.0065 

4.5 x 10“®"' 

4 

0-0-24 

0-0026 

U x I0“ T 

-5 

0-016 

0,0014 

1,9 x 10"' 

fi 

0 0053 

7.0 x 10 -11 

19 x10 -r 

7 

0,0026 

0,3 x lO"- 1 

2.7 x IQ" 1 ' 

8 

0,00 IT 

l.o x i&- 4 

1,3 x l0“ T 

9 

M x ID” 1 " 

9.7 x 10"* 

~3.2 bTlO" 6 


Table 4.2 1 Accuracy test for the uniform mad el. p is the highest degree of 
harmonics retained in an expansion. 


which is spherical and isotropic. Al is the mass of the system and r D is the 
scale-length. 

In the test the initial configuration of the system has two clusters, each of 
which is a Plummer system with 500 particles, Two clusters have the same 
M and r<> and axe separated by Sr 0 , A two-dimensional projection of this 
configuration is shown in Figure 1-3, 

The experimental results are given in Table 4.3. Again, they match very 
well with the theoretical prediction. 

In summary, the accuracy of the algorithm is demonstrated in Table 4.1, 
Table 4.2, and Table 4-3, These indicate that our method can compute forces 
and potentials to within any prespecificd tolerance up to the machine round¬ 
off precision, which is about 7 decimal digits* in single precision arithmetic 
[ReCS5]. As we retain more and more terms in the multi pole expansions, 
the accuracy of our method improves, and when p - 20 the error of the 
computation is at the level of the machine round-off errors. 


4*2.2 Running Time of the Algorithm 

In this section, we test the speed of our algorithm and experimentally 
determine how the running time grows with the number of part ides. Wc also 
compare the results with those of a direct computation and determine the 
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Figure 4-3; Initial configure!on af two Plummer systems 
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0,01166 
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6 
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Table 4.3; Accuracy test for the Plummer model, p is the highest degree of 
harmonies retained in an expansion, 
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Number of 
Particles 

Running Time (sec.) 

Speed-up 

M Lt It tpo le-ex p ansion 

Direct 

64 

12.60 

2 .ao 

0.1S 

130 

22.73 

9-08 

0.40 

256 

si.as 

30.04 

0.76 

513 

409.05 

156-34 

0.38- 

1034 

620.02 

616.62 

0.99 

20J6 

1050,65 

2468,72 

2.3 

4006 

5221.20 

935848 

L9 

a 192 

7203.37 

30433.92 

5-5 

1G3S4 

11411.63 

157735.60 

13.6 


Tabic 4.4: Speed test for the MufarpoEe-expansiori algorithm with p=3. Results 
of a direct computation are also presented for the purpose of comparison. The 
running time excludes paging time. 



Figure 4.4: Running time growth rate of the Multipole-expsnsion algorithm 
plotted against that of the direct computation. 
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running lime crossover point. In the test, particles are initially distributed 
uniformly within a spherical region. The space under consideration is the 
Cube of Spare containing t he spherical region. 

The results of the test on the speed of the algorithm are given in Table 4.4, 
ind aro plotted in Figure 4,4, The running time growth rate of the direct 
computation is also plotted for comparison' In the plot, both the running- 
time axis and the number-of-partieles axis are scaled logarithmically. The 
parameter p of the algorithm is chosen as 3 and the calculated potentials are 
accurate in average, to about ID"' 4 , 

figure 4,4 shows that the running time of our sequential implementation 
of the algorithm grows linearly with the number of particles. The jumps in 
the curve when the number of particles arc 500 and 4.000 axe due to the 
overhead of maintaining the tree structure when the tree grows one level 
deeper. The height of the tree grows logarithmically with the number of 
particles to enforce the constraint that the number of particles in each atomic 
cube is bounded by a predetermined constant. 

Tile running time Crossover point of our algorithm with the direct com- 
pul a. l* m is AC about 1,000 particles. 1 his means that for a required average 
accuracy of IQ " 4 for the potentials, our sequential algorithm is faster than 
direct computation when there are more than L G DO particles. 


4.3 Discussion 

We have numerically verified the three-dimensional algorithm with respect 
to the achieved accuracy and speed of the algorithm in the previous sections. 
Tn this section, we discuss some additional issues, concerning the use of our 
algorithm, 

fa) Tree Overhead 

We notti the overhead of maintaining the 3-D tree data structure in Sec¬ 
tion 4.2.2, The tree has a complicated pattern of links for the near-fields and 
the interactive-ft skis at each of the nodes. Much of the computation time is 
-spent on tracing these pointers, When the tree is not fully loaded, this part 
of the overhead will dominate, This accounts for the jumps in the curve in 
Figure 4,4, 
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(b) Distribution of Particles 

We have assumed in Chapter 3 that the distribution of particles is rel¬ 
atively uniform within the region under consideration. For very clumpy 
distributions* the performance of the algorithm will degenerate, ff we still 
construct the decomposition tree with height of L,l°Ss^ r Jf step (4) of the 
algorithm would take more than 0[N) time to compute all the near-field 
interactions. Otherwise* to maintain the conditio]! that- the number of parti¬ 
cles in cacti atomic cube is bounded by a predetermined constant* we would 
have to construct the tree with height of O(A’) in the worst case. Then, 
step (2) and step (3} of the algorithm in Figure 3.7 of Chapter 3 would take 
time exponential in N to compute all the multipole expansions. One way to 
improve the efficiency is to dynamically prune empty tree branches, that is, 
if no particles are founded in a branch, the branch will he pruned from the 
tree, 

ft) rj7j.L>:'cf of p 

We have demonstrated in Section 4,2,1 that the accuracy of our method 
improves as we retain more and more terms in the multipeds expansions. 
How big should the parameter p he in practice? This question depends Oil 
the specific nature of a problem. The goal of a numerical simulation is uot 
always “accuracy 1 ’ in a strictly mathematical sense* but rather “fidelity" to 
the underlying physics in a setise that is looser and more pragmatic [Ps86l. 
Often, as ia galaxy simulations, the conserved system quantities such as total 
energy and total momentum are of more interest, SO monitoring Oil Lliese 
quantities will better reflect how good an algorithm is. Sometime* we only 
want to look at the statistical behavior of a system. In such context, some 
types of errors arc much more tolerable than others, Another reason is that 
the error introduced by the discrete integration time step of a simulation is 
itself comparable with that of truncation [App35]. Therefore choosing small 
p often suffices. This will greatly reduce the constant factor in our algorithm. 


Chapter 5 

A Parallel Algorithm 


5,1 Introduction 

It is dear that the A-body method of Chapter 3 tan talee significant ad¬ 
vantage of patalielism. With a massively parallel computer, where we can. 
allocate a separate processing element for each particle, we can compute from, 
expansions the potentials for all of the particles in one parallel step. We can 
also propagate values up and down the spatial-decomposition tree, as in steps 
(2) and (3) of the algorithm in Figure 3.G, in parallel, generating expansions 
for all the nodes at a given love] in one parallel step. The depth of the tree 
grows as log jV, so a complete propagation will require time at least on the 
order of log jV, 

In this chapter, we present an extremely fast, implementation of the A r - 
body code -a p&ialEel implementation af the algorithm, whose measured run¬ 
ning time does indeed grow as 0(fo £ jV). The code runs on the Connection 
Machine, which is a massively parallel SIMD computer, consisting of up to 
65,536 processors, each with its own local memory, connected in a commu¬ 
nications network. On the Connection Machine model CM-2 that we have 
been using, each processor is a one-bit ALU with floating-point unit and 64K 
bits of local memory. 

in designing a practical algorithm for a large parallel machine, there are 
two issues that must he addressed in addition to the issue of deciding which 
steps should be performed concurrently. For many large parallel computa¬ 
tions, local computations at each processing element are relatively cheap, 
while communications among processing elements are expensive. The bot- 
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tScneck of the computation i& t-lx-e interprocessor communication, Thus, an 
efficient program must be designed to minimize both the need for mterpro- 
ceasor communication And the contention generated by whatever communi¬ 
cation is required. For the parallel jV-body code, we exploit the locality of 
the algorithm to reduce the need lor communication. In addition we exploit 
the regularity in the communication patterns of the algorithm in order to 
substantially reduce contention for the underlying coil'uimuication resource. 

5,2 Why a Parallel Computer? 

In the tree walking of our sequential algorithms computations on expansions 
could be done concurrently within each level of the tree. Tremendous speed¬ 
up can be expected if we exploit this parallelism. The question is how to map 
our problem onto a parallel machine, either a machine specially designed 
for this- problem or a commercially available one, such as the Connection 
Machine, 

Let ua take a dose look at the sequential algorithm we developed in Chap¬ 
ter 15. First, the tree computation has regularity over all nodes at each level of 
the tree. A SIMD machine suffices for computation of thin pattern. Second, 
the Lree computation has concurrency. Computation at each of the nodes 
within a single level of the tree can proceed at the same time, Third, the 
tree computation has spatial locality. In the upward-path of tree walking, 
each of the nodes talks only to iLs parent node. In the downward-path of 
tree walking, each of the nodes talks not only vertically to its parent, but 
algo laterally to those in its interactive-field. The lateral communication with 
a node's interactive-field is local to a Subtree rooted the node’s predecessor 
four levels Up in the tree. Depending on how we construct the tree on a 
parallel machine, we might be able to exploit this locality. Lastly, the tree 
computation has dependencies among levels of the tree. More specifically, 
computation at one level of the tree cannot proceed unless the computation 
one level up or down is finished. This determines that the optimal perfor¬ 
mance we can achieve from the tree computation is proportional to the height 
of the tree, that is, 0{Log A T ), where N ss the number of leaves in the tree. 

We choose the Connection Machine to implement, a parallel version of the 
algorithm developed in Chapter 3. 
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5,3 3-D Tree on the Parallel Computer 

We use a 3-D tree as our data structure, a a in the sequential implementation. 
The goal of the algorithm design is to distribute the component* of the 3-D 
tree structure among many processors so that the concurrency in the tree 
computation can be nuximumly exploited. 

Wc allocate a processor to each node of the tree. The upward-path and 
downward-path of the tree walking of the algorithm require that each node 
in the tree has links to it* parent els well as to its children. Explicitly storing 
these links would be expensive, since each node has one parent and eight 
children. However, since the tree has a fixed branching factor, a. regular 
allocation of processors for the tree will impose an implicit relationship among 
the processor addresses, as long els the tree is allocated within a well-defined 
region of processor*. This scheme is called the address-induced representation 
of the tree structure |Hil85] [Chr84j . 

More precisely, we allocate consecutive processor* to nodes, starting at 
the root of the Lroc. We first, make a sequential breadth-first scan of the tree 
to be constructed, and label each node as we encounter it, starting from 0 
at the root. Then we assign to every node a processor (or a fixed number 
of processors) whose address is the breadth-first scan label of the node, We 
therefore have a formula to induce the parent-child relationship for every 
processor in the tree. Given the address t of a processor, we know that its 
parent processor ha* address of [(*' “ l)/SJ t and its child processors have 
addresses from Si +1 to & + 8. By representing the tree in this way, we avoid 
Having to store pointers at every tree node. However, the saving in storage 
is achieved at the cost of computing the tree structure every time we trace 
pointers. 

We also allocate a processor to each of the particles in a test. \ Each leaf 
node of the tree remembers addresses of those particles that belong to the 
node spatially, so that those particle* can be accessed simply by referring 
to the leaf node during the computation of its near-field. Similarly those 
particles also remember the leaf node for the initial expansion stage of the 
computation. This representation of particle-leaf-node links ha* an additional 
advantage. After every time step of the simulation, the tree may be updated 
for particle* that cnave to new culls. This process is easy, since only particic- 

1Jt “ Possible that Eh* allocation nftbc tree «v«rl*pa. with that of the parities 
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leaf-node links need to be updated, 


5,4 A Parallel Algorithm 

In this section, we will describe a parallel implementation of the algorithm 
we developed in Chapter 3. The basic pattern? of communication include 
passing and combining data among processors vertically and laLerally with 
respect to the embedded tree structure. The description of the algorithm is 
summarized in Figure 5.1. Notations are used in the same way as that in the 
sequential algorithm of Figure 3,6. The number of levels iu the tree is n. 

In Step {!) of the algorithm, each node of the tree is mapped to a proces¬ 
sor, Since the tree only need? to be built once* the sequential construction is 
not a serious draw back as long as we amortize its cost over many time frames 
iu a simulation. 

Step (2) finds for every node of the tree its near-field and interact]ve- 
iicld nodus. This could be done at run time, since the addresses for these 
nodes take a fair amount of memory spaces- But this means that we have 
to find these nodes every time we want to access them. We know that there 
are SI nodes in a models near-field, and 567 nodes (using heuristics this can 
he reduced to 140 nodes) in each interactive-field (we ignore the boundary 
conditions). Experiments have show r n that the calculation of these fields i& 
very expensive. So we precompute these fields and store them. Step (3) 
inserts all the particles into the tree. 

Step (4) initially expands potential? for all particles and forms the ex¬ 
pansions 4> for leaf nodes of the tree. An expansion for the potential of a 
particle is valid outside the leaf node the particle belongs to* and has the 
center of the node as the reference point. All expansions of a single leaf node 
are added together. This is done all in parallel. (<&*? and ^P : s arc expansions 
as defined in Section 3.3* and are represented as arrays of parallel variables.) 

Step (&) implements the upward path of the tree walking, which computes 
■t for every node of the tree. The computation proceeds in this way: nodes at 
one level of the tree in parallel shift the 4's to the centers of their respective 
parents, combine the shifted according to whether they have the same 
parents, and send the results to the parents. 

Step (6) walks down the tree and compute? for every node of the tree. 
Every node in parallel fetches $ at its parent node, and shifts it to its center. 
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(1) Tree embedding: allocate one processor for tvery tree node, starting at the 

root. 


allocate processor 0 for the root; 
for i from 1 to r do h^giti 

for j from l to S' do begin 

allocate processor 8<S i ” 3 -l)/7 + j tor the ;th to the 
leftmost mode at le^al i of the tree, 
end. 
end* 


(2) Find for each of nodes its near-field and in ter active-field nodse; 

for cli nodes of the tree in parallel do begin 

each BCde finds Its near-field and interactive-field nodes 
&nd stores the addresses. 

end. 

(3} Insert A r particles: 

fort frofc 1 to N do hffgin 

Insert particle i into the cme, 
end. 

(1) Initial expansion at ley nodes: 

for n(/ particles tn parallel do begin 

•compute expansion <& for the potehtial due to e iC h of particles 
relative to carter of the leaf node the particle belongs to; 
4d,d together those <fr J a due to particles of the same leaf hpde 
and fora -h for that leaf node. 

end. 


Figure 5.1: A parallel algorithm 
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£5) Upward-patJi of tree wjdluEvg: 

for i from n doun to 1 do begin 

for ail nodes at; l*v*l i in pamttet do begin 
it each node 

shift the $ to the center of its parent node; 
sura the resulted expansions of those uho have the 
same parent node and form $ for that parent node. 

end. 

end. 

(&) Downward-path of tree walking; 

for j fron 1 to n do begin 

for ttlj nodeo at level i m purit/ii f do begin 
at each node dl 

Cfia) shift tjj at <Jl J s parent node to <11 "s center; 

{fib) far node d'2 in dl 's interactive-?{aid do begin 
shift <$■ at <i2 to dl's center; 
end. 

add the shifted together, 
add ffia) and {fit) to form 4 1 for node dl. 

end. 

end. 

{7) Final evaluation; compute near-field and far-field inLnpactions for e&ch of 
particles. 

for «U particles in jjcjmt^cf do begin 
at each particle 

{7a> evaluate at the particle position the 'S of the leaf 
node the particle belongs to; 

{7bj compute directly potentials due to its near^field. 
add {7a) and {7b) to form the desired potential. 

end. 


Figure 5.2: A paraM algorithm (eon't) 
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This is done level by level, as in Step (5), 

Finally, step (7) finishes the computation, Every particle evaluates in 
parallel the corresponding tf’* at the particle position- This part of the 
potential is due to the interaction with its far-fidd, Then every particle 
computes in parallel the near-field interactions directly. The sJm of the 
far-field and near-field interactions is the desired potential at this particle 
position. 

We have seen from the above description that computation at a given 
level of the tree Ln steps (4)-(7) talus one parallel step. Thus, step (4) 
and step (7) each take constant time. Since the tree hm depth of log N\ 
and computations at different levels of the tree have dependencies among 
them, a complete propagation in step (5) and step (6) takes O(log jY) time. 
Therefore, the time complexity of the parallel algorithm, excluding the initial 
set-up time for the tree and communication costs, is 0(\og N), 

Let us now look at the memory usage in the computation. When we 
retain in an expansion only the terms of spherical harmonics with degree 3ftss 
than four, the total number of terms in an expansion is 20. As verified by the 
experimental results from the sequential implementation, this guarantees iu 
average a four decimal-digit accuracy in potentials. Each of the coefficients 
.XL an expansion is a 152-blt Single precision floating-point number. We have 
three different arrays of coefficients for each node of the tree in the computa¬ 
tion, Thus expansions alone take about 2R bits of memory at each of nodes. 
Storing the interactive-field and near-field takes another 4K hits of memory. 
Wc also need some stack space for intermediate computation. Therefore, 
about 1DK bits of local memory at each node suffice for our purpose. In case 
a machine has insufficient amount of local memory per processor, we can 
remedy this by simulating a node of the tree with several processors. In such 
a case, however, the cost of the communication overhead would be high. 

5.5 Experimental Results 

We have implemented the parallel algorithm, on the Connection Machine. 
The experimental results are summarised in Table 5.L. The running time 
growth rate of the implementation is plotted in Figure 5,3, along with that 
of the sequential implementation. The results experimentally verify that the 
parallel implementation of the algorithm on the Connection Machine scales 
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Number of 
Particles 

Running Time (sec.) 

Speed-up 

Lisp Machine 

Connection Machine 

64 

12.80 

9.96 

1-4 

128 

23-73 

14,09 

1.3 

256 

&i .as 

23.72 

2-2 

513 

499.S5 

33.66 

12 

1024 

620.92 

51.75 

12 

2045 

1650.65 

00.05 

17 

4096 

5221.29 

71.83 

73 

SI 92 

7203.37 

79-63 

90 

16354 

11411.08 

94.17 

121 


Table 5.1: Experimental results an the Connection Machine. The running time 
excludes paging time. 



Figure 5.:}: Running time growth rate of the parallel algorithm, with that C>( the 
sequential implementation 
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logarithmically in the total number of particle? in the simulation, tven when 
we take the communication into the consideration. We find that the parnlle] 
algorithm rum. faster than the sequential algorithm, even on small exarm 
pies. Compared with the sequential implementation, the parallel algorithm 
exhibits a. speed-up factor of about 10 for AT* 1,000 and a speed-up factor of 
about 100 for A r = 10,CH>0 J . 

The complexity issue of the parallel algorithm, which is complicated by 
the need for communication, will be discussed in more detail in the following 
sect ion „ 


5.6 Communication Patterns 


We have discussed the complexity issue due to the floating-point compu¬ 
tation in Section 5.4. Our experiments have shown that the floating-point 
computation takes about 20% of the total running time and the interproces- 
sor communication consumes a major portion of the rest 3 . 

A careful study at the communication patterns reveals that there are 
two major kinds of communication going on in the computation. The first 
kind of communication occurs between adjacent levels of the tree. We call 
this vertical wmmnmcatwn . The second kind of communication occurs 
between nodes and their interactive-fields and new-fields, and Cakes place 
within a single level of the tree. We call this lateral communication*. The 
experimental results show that, when the tree is averagely loaded, the lateral 
communication constitutes about 20% of the total running time, and the 
vertical communication accounts for another 25%. In the fallowing sections, 
we describe various optimizations we have made to reduce the communication 
overhead. 


J The parallel compulation has about 200 million floating-point escalations in a test of 
1,000 particles, and has about TWO million floating-point calculations in a. test ai 10,000 
particles. 

3 Tlie Jesuits aw phtain-ed on the Connect^ Machine max the meLtrimr orocram 
provided by Shawn Mcirnn. 


from SMttou 4.1.2 that * ^i-node if * node firmed by greppLag on 
interactive-field nodes. Communicating with a super-node in the lateral eemnuimca- 
t]o*i involve* two adjacent levels of the tree. Although this situation is ainular to the 
vertical mmmaiiit*(inn, for the snkc of simplicity we will consider it to be the lateral 
eo&iiruiq-iir_ation. 
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5.6*1 Reducing Communication Bottlormcks 

The lateral communication required for computing interactions with near- 
fields and interactive-fields is one of the most expensive parts of the parallel 
computation. The cost of the lateral communication is due largely to mes¬ 
sage collisions in which many processors try to access a single processor at 
the same time. Since processor access on the Connection Machine i.s handled 
in an exclusive reading and exclusive writing way, the time of a successful 
message routing is proportional to the maximum number of message colli¬ 
sions. 

We tried to ease the communication bottleneck by reducing the number 
of co 11 is ion a in the lateral communication. We know that each node has 14(1 
nodes in its interactive-hold, and those nodes are represented as. a list of 
addresses. A close look shows that many processors often have the same 
node as their interactive-held! node at the same time, due to the way the 
anteractivc-ficld list is constructed. Qur algorithm, therefore, randomizes 
the order in which one’s interactive-field nodes nodes are accessed, in the 
hope that two processors are less likely to access a single processor at the 
same time. As a result, the randomization improves the performance oF the 
algorithm. 

5.6.2 Localising Communication 

In computing the interaction with a node’s interactive-field, we need to use 
coefficients of expansions stored in an interact] ve- fid d no-de again and again. 
Instead of fetching these every time we use them, we fetch them once and 
store them on a temporary local stack. This can reduce the traffic by a factor 
of b. when computing expansions whose highest degree is three. 

5.6.5 Combining and Delegating Messages 

The Connection Machine router has fairly good performance when the net¬ 
work is averagely loaded, hut. the routing time scales up as the number of 
collisions ill the rooting process. Collisions are mostly due to Concurrent read 
or concurrent write to a single processor. To reduce the collisions, we can use 
the scan operation, which does tile segmented parallel prefix computation on 
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Lbc Connection Machine 5 . To resolve collision due to the concurrent reart, 
we fim do an exclusive read, and then do a seam-copy operation. In case of 
the Concurrent write combined with addition, we da a scan-* and then an 
exclusive write instead. If elements of a segment to be scanned are not in 
consecutive processors., in order to apply the scan operator we have to first 
project the elements to a new segment so that the new segment has elements 
in consecutive processors. This extra projection pays off when there are more 
than thirty processors in a segment. 



Figure 5.4: Concurrent write 


In our implementation of the algorithm, the final direct interactions with 
a node’s near-held often gives rise to IQ-20 collisions. Therefore we use 
a different method to reduce the collisions. The idea is to structure the 
message ranting through a tree that distributes the message collisions over 
rnarty intermediate nodes. This tree-like routing allows many of the collisions 
to be handled in parallel, thereby reducing the total time delay due to the 
collisions. 


Consider an example in which 16 processors access a single processor for 
concurrent write (Figure 5,4). Using the tree routing scheme outlined above. 


"■ C i viti 


* bLn3j y op^sinr ♦ and a segmenu of sequence of Aments 

■pn, ihc parallel prefix operation computes el * r 3 , ..., jji * it? + . ., Jn 


, nr i r — ■—-» |- l vi'ijl uuivUI g^liijJUiiAa J. |_ ,, ,£ [ n X J , „ P| X j * Xj + ■ m J m 

in <9(Log(m)) i.iiik [HJSfl]. In particular, Lbr-ra are seart-d- and jcaii-e»pj operatjewm The 
^esn operation in very efficiently implemented on the Connection Machine. For simple 
binary operation* such as -I- and copy, Lt takes the same order of time aa a renting eyrie 
dues, which is the fundamental unit fnr time measurement, and Is therefore considered to 
take constant amount of tinte. 
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Figure fi.fi: Combining messages 



Figure 5,6: Delegating messages 
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we can use a two-level tree to combine every [our of the writes together, and 
then combine every four of the resulting writes (Figure 5,5). Even though 
there arc 20 total collisions in this tree, the time delay is only proportional 
to S collisions because the nodes at the intermediate Jewel of the tree handle 
their collisions in parallel. This process is called rntpsage-^mbininc/. The 
dud process, mesa^e-dclcgating, reduces collisions due to concurrent read, 
as illustrated tn Figure 5-6- 

In general, suppose there are N collisions in total. Each stage of combin¬ 
ing or delegating reduces the number of messages by a factor of m. Therefore 
the number of stages is ]og m N, What h the optimal value of m such that 
the function 

RaUti ng-ti m e(m, N ) = m log^ (jV) 

]« minimal? Using elementary calculus, we find that the optimal m = e, 
where e is the base of the natural logarithm 2.71 Of course, we must 
actually choose an integral value for m (2 or 3). 

Message-combining and message-delegating have been very effective, and 
in practice have given a factor of 2 improvement, in speed, 

5.7 Discussion 

In this section, we discuss trade-offs we made in our parallel implementation, 
and suggest alternatives to this implementation. 


5.7,1 Grid Representation 

We have used an address-induced scheme to construct the tree on the Con¬ 
nection Machine. This mapping scheme has the merit of simplicity, ft also 
Saws memory space, by avoiding explicitly storing pointers in the"tree. As 
a result of this consecutive mapping scheme, only a subset of processors are 
actrve at a Lime, since the computation proceeds level by level in the tree. 

Ibe alternative to the address-induced scheme is to use the Connection 
Machine grid-like communication network called the NEWS grid. This struc¬ 
ture provides fast communication, for local or highly structured communica¬ 
tion patterns [Hi 185). Every processor is aligned a grid coordinate and can 
be addressed by specifying this coordinate. The tree can he embedded in the 
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Fiji]re 5-7- Superimposed mapping,. Each square irt the grid represents a pro¬ 
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Figure 5-8: Won-superimposed mapping. Each square in the grid represents a 

processor, and each number represents a node in the tree. 
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network in two different ways. One way ] B to superimpose levels of the tree 
so dial nodes at one level may share processors with those at other levels, as 
illustrated in figure 5.7. As a result., some processors require more memory 
than Others do, This scheme makes the processor utilization at most 100%. 
However, since all processors on the Connection Machine must have the same 
amount of local memory, the nonuniform memory requirement across proces¬ 
sor wastes memory in some processors. The alternative is to unfold levels 


of the tree and to assign to every node a processor, as in Figure 5, 0 , X u llS 
makes the memory utilization at most 109% but wastes some processors, 

By using the NEWS grid, we can take advantage of the fast grid touting 
for highly Structured communication patterns, such as the communication 
with interactive-fields discussed below. 


5.7.2 Exploiting Regularities in Communication Pat¬ 
terns 

To further reduce the number of collisions discussed in Section G.fi.l, Alan 
Rutteriber# has suggested a scheme to exploit the regularities in the commu 
nication patterns. We have not implemented this, so we do not know how 
much it decreases the running time of the algorithm. To keep the discussion 
simple, we present the scheme in two dimensions, The extension to three 
dimensions is straightforward. 

We classify Squares of the spatial decomposition into Four classes. Suppose 
ill the squares of smallest size of Figure 5,9 are at level l. We first group 
every four squares that share a common parent at level f-1. Then we classify 
-he four squares of v. group as class 1, 2 ,3, and 4 respectively according to 
the relative position of a square in the group, as labeled in the Figure 5.9, 

We find that for every group there is spatial symmetry among the 
interactive-holds of its fonr squares. In Figure 5.10, square o' is an interactive- 
field square of square a- A clockwise rotation of 90 degrees of the square 
f with respect to the point O results in another square 4\ which is an 
interactive.field square of square and so on. As a result, of the spatial 
symmetry, the four squares a\ V, t', and -tf are in different classes, as shown 
m Figure 5.10. This symmetry property can be- used to completely eliminate 
message collisions in the communication with interactive-field squares. The 
idea is as follows-: each of four squares a b, c, and d interacts with Squares 
a i i respectively at the same time so that the communication 


CHAPTER 5, A PARALLEL ALGORITHM 


60 









1 

2 

I 

2 



A 

3 

4 

3 



1 

2 

1 

S 



A 

4 

4 

3 









FLgism p.Q: Classification of squares 



1 - 

a’ 










3 



i 

a 

2 

if 





4 O 
d 

i 

e 



1 

d' 










4 

£ ' 



Figure 5- 10: Symmetric communication pattern 
























CHAPTER 5. A PARALLEL ALGORITHM 


61 


pit-tern is symmetric with respect to four squares; the same is true for all 
other groups. Since the four squares a', ^ and d? are all in different classes 

ami communication patterns of two different groups are identical except that 
they are shifted from each other by settle distance, none of the four squares 
a\ Iff d, and d' will be fetched by more than one square at a time. Thus, 
Lateral communication Can take place without collision*. 

In Section 4.1.2. we discussed super nodes in the interactive-fields. Since 
the super nodes are one lave] higher in the decomposition tree than those 
ordinary nodes in the interactive-field, interactions with a super node using 
the above scheme will give rise to collision*. However, the maximum number 
of interactions to a super node will not exceed 4 at a time. Therefore the 
collisions can be avoided by spreading the data to he fetched at a super node 
to its four children, and afterwards, access to the super node wilt be directed 
to the four children. 

fn the above discussion we also ignored the boundary case. We can intro¬ 
duce dummy squares outside the boundaries so that boundary squares staff 
preserve the symmetry property of interactive-fields. 


5.8 Conclusion 


We have described the parallel implementation of the three-dimensional jV- 
body algorithm that runs on the Connection Machine in order 0(log jV'J time 
in this chapter. Compared with the previous A'-body methods, our method 
has a demonstrated advantage in simulating Large number of particles. The 
parallel implementation achieves tremendous speedup in running time and 
computes the forces and potentials to within any prespedfkd tolerance up 
to machine precision. 

Combining the results of this chapter with the analytical results derived 
,n Chapter 3 and the experimental results of the sequential implementation in 
Chapter 4, we have presented a complete analysis of the three-dimensional 
JV-body algorithm both theoretically and experimentally. necAusc of the 
superior speed and accuracy of our algorithm, we expect that it will find 
applications m astrophysics, plasma physics, fluid dynamics, and molecular 
dynamics. Nevertheless, there are additional improvements that we believe 
will enhance the practicality of the algorithm and, therefore, are worth fur¬ 
ther research effort. 
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Wc h ave observeJ in Section 5,6 that local computation—mainly floating¬ 
point calculation at each processing ■clement of the parallel implementation— 
takes only about 20% of total running time- The rest, of the running time 
is mostly spent on communication among processing elements. The results 
indicate that the bottleneck is the intcrproccssur communieatiod. Therefore, 
a:: efficient implementation of the algorithm should explore the communica¬ 
tion requirements of the algorithm. 

In Section 5.T we described alternatives to the current parallel implemen¬ 
tation to improve the performance of the algorithm. Ely using the NEWS 
grid, we can exploit the localities of the algorithm and take advantage of the 
fast grid routing. The use of the h'EWS grid will aLso enable us to exploit the 
regularities in the literal communication of the algorithm. We expect that, 
the resulting implementation will ease the bottleneck of the mterprocessor 
communication - 
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